Scientific Python antipatterns advent calendar day five

For today, another example that’s to do with performance. As a reminder, I’ll post one tiny example per day with the intention that they should only take a couple of minutes to read.

If you want to read them all but can’t be bothered checking this website each day, sign up for the mailing list:

Sign up for the mailing list

and I’ll send a single email at the end with links to them all.

String concatenation inside a loop

One of the first tools that I teach complete beginners in Python is + to concatenate two strings:

'Hello ' + 'world'
'Hello world'

so when we need to construct a string in many steps it’s tempting to just do it with a loop. Imagine we have a list of words:

import random, string

# randomly pick 1000 fruit names
fruits_small = []
for i in range(1_000):
    fruits_small.append(random.choice(['apple', 'banana', 'orange', 'strawberry']))

# show the first ten
fruits_small[:10]
['orange',
 'orange',
 'banana',
 'strawberry',
 'orange',
 'orange',
 'orange',
 'strawberry',
 'apple',
 'orange']

and we want to generate a comma-separated list. The obvious code is something like this:

string = ''
for fruit in fruits_small:
    string = string + fruit + ','

# just show the first 100 characters
string[:100]
'orange,orange,banana,strawberry,orange,orange,orange,strawberry,apple,orange,banana,orange,strawberr'

As we can see from the output, this works fine. But if we do some timing experiements, we will notice something unsettling. With 1000 words the loop takes around 190 microseconds on my laptop:

%%timeit
string = ''
for fruit in fruits_small:
    string = string + fruit + ','
196 μs ± 9.02 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

What will happen if we make the number of words ten times bigger?

fruits_large = []
for i in range(10_000):
    fruits_large.append(random.choice(['apple', 'banana', 'orange', 'strawberry']))

We might expect that the execution time would also increase by a factor of ten, to around 1.9 milliseconds. However, the actual timing:

%%timeit
string = ''
for fruit in fruits_large:
    string = string + fruit + ','
16.4 ms ± 306 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

shows that the execution time goes up by a factor closer to 100. This is because strings are immutable in Python; every time we do a concatenation, the computer has to create a whole new copy of the original string in memory. As the string gets longer, this copying takes more and more time.

The fix is to use the join method. If we go back to our small list and try it:

string = ','.join(fruits_small)
string[:100]
'orange,orange,banana,strawberry,orange,orange,orange,strawberry,apple,orange,banana,orange,strawberr'

we get the same output as before. But if we add the timing code:

%timeit string = ','.join(fruits_small)
%timeit string = ','.join(fruits_large)
8.85 μs ± 312 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
84.3 μs ± 1.37 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

we see that this solution is not only much faster for both lists, but has much more reasonble scaling behaviour - the large list, which is ten times larger, takes almost exactly ten times longer to join.

One more time; if you want to see the rest of these little write-ups, sign up for the mailing list:

Sign up for the mailing list